{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": [
        "MR72r4ByqbI2",
        "Ez0rkhNSqfGt",
        "3jaPDunGrk-8",
        "8th4IbwfsQyn",
        "wU64sRBLs5Em",
        "CfceQVqxtMf_",
        "1s1Lcy0ttV9v"
      ]
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# This notebook demonstrates drift detection for text data with evidently using IMDB reviews and eco-dot reviews.\n",
        "\n",
        "#### In the first part of the notebook we demonstrate using glove vectors and in the later part of the notebook we show how sentence transformers could be utilized.\n",
        "\n",
        "#### We demonstrate the usage with both glove vectors and sentence transformers\n",
        "\n",
        "#### In case of glove vectors, each record is represented as a sentence vector by using glove6B 50d vector (by averaging) for both IMDB and eco-dot reviews datasets. Incase a vector representation is not present for a particular word, we use 50d zero vector.\n",
        "\n",
        "#### In case of sentence transformers we use the MiniLM-v2 model to get the embeddings which is a 384 vector representation. This is "
      ],
      "metadata": {
        "id": "00za-DuIlJMJ",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install and import the pre-requisites"
      ],
      "metadata": {
        "id": "9Df0HC1el0H3",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8uITYL4VpcNk",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "try:\n",
        "    import evidently\n",
        "except:\n",
        "    get_ipython().system('pip install git+https://github.com/evidentlyai/evidently.git')\n",
        "    # Install sentence transformers\n",
        "    get_ipython().system('pip install sentence-transformers')"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "\n",
        "from sklearn import datasets\n",
        "\n",
        "from evidently import ColumnMapping\n",
        "from evidently.report import Report\n",
        "from evidently.metrics import EmbeddingsDriftMetric\n",
        "from sklearn.model_selection import train_test_split\n",
        "from evidently.metrics.data_drift.embedding_drift_methods import model, distance, ratio, mmd"
      ],
      "metadata": {
        "id": "NgUhVTDKp51n",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Load cleaned Imdb movies review dataset and amazon ecodot review dataset\n",
        "\n",
        "The datasets used in this notebook are from the below resource\n",
        "\n",
        "https://github.com/SangamSwadiK/nlp_example_datasets\n",
        "\n",
        "In case, you do not want to load the glove vectors, and want to use preprocessed vectors and view the drift, please go to https://github.com/SangamSwadiK/nlp_example_datasets and use the rawgithub url's or download it and load them.\n",
        "\n",
        "#### Please go through the below kaggle resources for pre-processing and if you want to acess the original datasets:\n",
        "1) https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews\n",
        "\n",
        "2) https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset"
      ],
      "metadata": {
        "id": "pgQO_xXhSjtm",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# IMBD reviews dataset\n",
        "imdb_5k_data = pd.read_csv(\"https://raw.githubusercontent.com/SangamSwadiK/test_dataset/main/cleaned_imdb_data_5000_rows.csv\")\n",
        "imdb_5k_data.head()"
      ],
      "metadata": {
        "id": "Gn-0gcQ3qI9M",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#This data is oob and has no relationship with movie reviews, you can replace the test set with eco_dot to better understand the workings of drift detection.\n",
        "eco_dot_data = pd.read_csv(\"https://raw.githubusercontent.com/SangamSwadiK/test_dataset/main/eco_data.csv\", squeeze=True)\n",
        "eco_dot_data.head()"
      ],
      "metadata": {
        "id": "7oMiAypxqQK0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "## Run this to experiment with the dataset with various ways of embedding (average over records / sum of records etc ...)\n",
        "!wget http://nlp.stanford.edu/data/glove.6B.zip -P ./\n",
        "!unzip  glove.6B.zip -d ./"
      ],
      "metadata": {
        "id": "UTu00qwVqRcL",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Load glove vector from vector file\n",
        "def load_glove_model(File):\n",
        "  \"\"\" Loads the keyed vectors from a given text file\n",
        "  Args:\n",
        "    File: text file which contains the vectors\n",
        "  Returns:\n",
        "    Dictionary: map containing the key:vector pair\n",
        "  \"\"\"\n",
        "  glove_model = {}\n",
        "  with open(File,'r') as f:\n",
        "      for line in f:\n",
        "          split_line = line.split()\n",
        "          word = split_line[0]\n",
        "          embedding = np.array(split_line[1:], dtype=np.float64)\n",
        "          glove_model[word] = embedding\n",
        "      return glove_model"
      ],
      "metadata": {
        "id": "d9lAgEnjqXrs",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# We load 50 dimension vector here\n",
        "glove_vec = load_glove_model(\"glove.6B.50d.txt\")"
      ],
      "metadata": {
        "id": "vB21brIlqZol",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Train test split the imdb dataset for input data drift comparison"
      ],
      "metadata": {
        "id": "MR72r4ByqbI2",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "## Perform train test split on imdb data\n",
        "train_df, test_df, y_train, y_test = train_test_split(imdb_5k_data.review, imdb_5k_data.sentiment, test_size=0.50, random_state=42)"
      ],
      "metadata": {
        "id": "bDoYMaXaqcfE",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Convert train and test records into embedding vectors"
      ],
      "metadata": {
        "id": "Ez0rkhNSqfGt",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def get_sentence_vector(dataframe):\n",
        "  \"\"\"Get a sentence vector for each text/record by averaging based on counts for each text record\n",
        "  Args:\n",
        "    dataframe: the dataframe containing the text data\n",
        "  returns:\n",
        "    array: a matrix of sentence vectors for each record in the dataframe\n",
        "  \"\"\"\n",
        "  tmp_arr = []\n",
        "  for row in dataframe.values:\n",
        "    tmp = np.zeros(50,)\n",
        "    for word in row:\n",
        "      try:\n",
        "        tmp += glove_vec[word]\n",
        "      except KeyError:\n",
        "        tmp+= np.zeros(50,)\n",
        "    tmp = tmp/len(row.split(\" \"))\n",
        "    tmp_arr.append(tmp.tolist())\n",
        "\n",
        "  return tmp_arr"
      ],
      "metadata": {
        "id": "Xd7H9QgbqgfV",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "train_matrix =  get_sentence_vector(train_df)"
      ],
      "metadata": {
        "id": "bZFIoQCOqgxs",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "train_df_converted = pd.DataFrame(np.array(train_matrix), index = train_df.index)\n",
        "train_df_converted.columns = [\"col_\"+ str(i) for i in range(train_df_converted.shape[1])]\n",
        "train_df_converted.head()"
      ],
      "metadata": {
        "id": "TeU2u_60qiRn",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "## Get the sentence vectors for test dataframe\n",
        "test_matrix =  get_sentence_vector(test_df)"
      ],
      "metadata": {
        "id": "UzhMo1JlqlG3",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "test_df_converted = pd.DataFrame(np.array(test_matrix), index = test_df.index)\n",
        "test_df_converted.columns = [\"col_\"+ str(i) for i in range(test_df_converted.shape[1])]\n",
        "\n",
        "test_df_converted.head()"
      ],
      "metadata": {
        "id": "gmCCsgdFqmWe",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Get sentence vector for echo dot\n",
        "eco_dot_matrix = get_sentence_vector(eco_dot_data)"
      ],
      "metadata": {
        "id": "26EPMtWYqm93",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "ecodot_review_df_converted = pd.DataFrame(np.array(eco_dot_matrix), index = eco_dot_data.index)\n",
        "ecodot_review_df_converted.columns = [\"col_\"+ str(i) for i in range(ecodot_review_df_converted.shape[1])]\n",
        "\n",
        "ecodot_review_df_converted.head()"
      ],
      "metadata": {
        "id": "-26TF_BDqq26",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Embeddings Drift Report\n",
        "\n",
        "Here we take a small subset of the columns and calculate the drift based on it.\n",
        "To understand more about how its being calculated checkout the below link\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embeddings_drift.py#L27"
      ],
      "metadata": {
        "id": "3jaPDunGrk-8",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "column_mapping = ColumnMapping(\n",
        "    embeddings={'small_subset': train_df_converted.columns[:10]}\n",
        ")"
      ],
      "metadata": {
        "id": "jpG44R5MrkSU",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Here we measure drift on the small subset between train and test imdb records\n",
        "report = Report(metrics=[\n",
        "    EmbeddingsDriftMetric('small_subset')\n",
        "])\n",
        "\n",
        "report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000], \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "-q94CajFr34M",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Embeddings Drift Detection: model\n",
        "\n",
        "This approach involves training an SGD Classifier and calculating the ROC AUC score. The drift is measured on the same. If bootstrap and PCA components are enabled, it performs dimensionality reduction with PCA and then performs bootstrap for the ROC AUC and returns the drift result\n",
        "\n",
        "Checkout the below link to understand more on how its calculated\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L99"
      ],
      "metadata": {
        "id": "8th4IbwfsQyn",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = model(\n",
        "                              threshold = 0.55,\n",
        "                              bootstrap = None,\n",
        "                              quantile_probability = 0.95,\n",
        "                              pca_components = None,\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "X5bGhXFssLs0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Embeddings Drift Detection: mmd\n",
        "\n",
        "Here, The drift is calculated using the maximum mean discrepancy test (MMD).\n",
        "If you want to learn more about MMD checkout the below paper\n",
        "https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf\n",
        "\n",
        "If you want to understand how MMD is used as a Multivariate test checkout the below paper\n",
        "\n",
        "https://arxiv.org/pdf/1810.11953.pdf\n",
        "\n",
        "\n",
        "To understand more about the implementation checkout the below\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L201"
      ],
      "metadata": {
        "id": "wU64sRBLs5Em",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = mmd(\n",
        "                              threshold = 0.015,\n",
        "                              bootstrap = None,\n",
        "                              quantile_probability = 0.95,\n",
        "                              pca_components = None,\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "tayoyWEDsXzk",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Embeddings Drift Detection: ratio\n",
        "\n",
        "Here, The drift is calculated based on the ratio of drifted embeddings, we look at each individual embedding and then apply a statistical test which can be picked from the evidently stattests module (https://docs.evidentlyai.com/reference/api-reference/evidently.calculations/evidently.calculations.stattests)\n",
        "\n",
        "In case you want to reduce the dimensionality and use this method, use the pca_components parameter.\n",
        "\n",
        "Checkout the below link to understand more about how this works\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L139"
      ],
      "metadata": {
        "id": "CfceQVqxtMf_",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = ratio(\n",
        "                              component_stattest = 'wasserstein',\n",
        "                              component_stattest_threshold = 0.1,\n",
        "                              threshold = 0.2,\n",
        "                              pca_components = None,\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "szlYuJpDtKwV",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Embeddings Drift detection: Distance\n",
        "\n",
        "Here we use the average distance method for measuring the drift \n",
        "The available distances are euclidean, cosine, cityblock and chebyshev.\n",
        "\n",
        "If bootstrap is enabled, it performs bootstrapping to calculate drift based on quantile probability. If not enabled, it uses the threshold parameter. All values above this threshold means data drift. This only applies when bootstrap != True\n",
        "\n",
        "\n",
        "To understand how this is implemented checkout the below link\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L45"
      ],
      "metadata": {
        "id": "1s1Lcy0ttV9v",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = distance(\n",
        "                              dist = 'euclidean', #\"euclidean\", \"cosine\", \"cityblock\" or \"chebyshev\"\n",
        "                              threshold = 0.2,\n",
        "                              pca_components = None,\n",
        "                              bootstrap = None,\n",
        "                              quantile_probability = 0.95\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = train_df_converted[:500], current_data = test_df_converted[500:1000],  \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "qTfw3GiptTjt",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Drift detection with sentence transformers\n",
        "\n",
        "This demonstrates usage of sentence transformers with evidently for drift detection"
      ],
      "metadata": {
        "id": "76lNDjO8Qnu-",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Convert to embeddings"
      ],
      "metadata": {
        "id": "yW8o8uYZQta_",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# import MiniLM v2 from sentence transformer\n",
        "\n",
        "from sentence_transformers import SentenceTransformer\n",
        "model_miniLM = SentenceTransformer('all-MiniLM-L6-v2')"
      ],
      "metadata": {
        "id": "GtkESF37QuqS",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Encode only a fraction\n",
        "ref_embeddings = model_miniLM.encode(imdb_5k_data[\"review\"][: 100].tolist() )"
      ],
      "metadata": {
        "id": "JtwCZ-GyQvFZ",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "ref_df = pd.DataFrame(ref_embeddings)\n",
        "ref_df.columns = ['col_' + str(x) for x in ref_df.columns]\n",
        "ref_df.head(5)"
      ],
      "metadata": {
        "id": "7aj0e9NgQx5a",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Similarly encode only a fraction\n",
        "cur_embeddings = model_miniLM.encode( eco_dot_data.tolist()[:100] )"
      ],
      "metadata": {
        "id": "V_HkgRLqQz7t",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "cur_df = pd.DataFrame(cur_embeddings)\n",
        "cur_df.columns = ['col_' + str(x) for x in cur_df.columns]\n",
        "cur_df.head(5)"
      ],
      "metadata": {
        "id": "EjKbTK3BQ1Qz",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Embeddings Drift Report\n",
        "\n",
        "\n",
        "Here we take a small subset of the columns and calculate the drift based on it.\n",
        "To understand more about how its being calculated checkout the below link\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embeddings_drift.py#L27"
      ],
      "metadata": {
        "id": "eCceyrQFvI6X",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "column_mapping = ColumnMapping(\n",
        "    embeddings={'small_subset': ref_df.columns[:10]}\n",
        ")"
      ],
      "metadata": {
        "id": "_XgVbxLwtjdu",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics=[\n",
        "    EmbeddingsDriftMetric('small_subset')\n",
        "])\n",
        "\n",
        "report.run(reference_data = ref_df[:50], current_data = cur_df[:50], \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "CSXvPrB_vobV",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Embeddings Drift Detection: model\n",
        "\n",
        "This approach involves training an SGD Classifier and calculating the ROC AUC score. The drift is measured on the same. If bootstrap and PCA components are enabled, it performs dimensionality reduction with PCA and then performs bootstrap for the ROC AUC and returns the drift result\n",
        "\n",
        "Checkout the below link to understand more on how its calculated\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L99"
      ],
      "metadata": {
        "id": "X5J40s1ufWxF",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = model(\n",
        "                              threshold = 0.55,\n",
        "                              bootstrap = None,\n",
        "                              quantile_probability = 0.95,\n",
        "                              pca_components = None,\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = ref_df[:50], current_data = cur_df[:50], \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "bE0AZxEpfWxF",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Embeddings Drift Detection: MMD\n",
        "\n",
        "Here, The drift is calculated using the maximum mean discrepancy test (MMD). If you want to learn more about MMD checkout the below paper https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf\n",
        "\n",
        "If you want to understand how MMD is used as a Multivariate test checkout the below paper\n",
        "\n",
        "https://arxiv.org/pdf/1810.11953.pdf\n",
        "\n",
        "To understand more about the implementation checkout the below\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L201"
      ],
      "metadata": {
        "id": "btLgtFLHexWL",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = mmd(\n",
        "                              threshold = 0.015,\n",
        "                              bootstrap = None,\n",
        "                              quantile_probability = 0.95,\n",
        "                              pca_components = None,\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = ref_df[:50], current_data = ref_df[:50],  \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "kF8bHN2ov_ok",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Embeddings Drift Detection: ratio\n",
        "\n",
        "Here, The drift is calculated based on the ratio of drifted embeddings, we look at each individual embedding and then apply a statistical test which can be picked from the evidently stattests module (https://docs.evidentlyai.com/reference/api-reference/evidently.calculations/evidently.calculations.stattests)\n",
        "\n",
        "In case you want to reduce the dimensionality and use this method, use the pca_components parameter.\n",
        "\n",
        "Checkout the below link to understand more about how this works\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L139"
      ],
      "metadata": {
        "id": "PHnlekfafC1x",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = ratio(\n",
        "                              component_stattest = 'wasserstein',\n",
        "                              component_stattest_threshold = 0.1,\n",
        "                              threshold = 0.2,\n",
        "                              pca_components = None,\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = ref_df[:50], current_data = ref_df[:50],  \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "alr3IRHEwOH0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Embeddings Drift detection: Distance\n",
        "\n",
        "Here we use the average distance method for measuring the drift \n",
        "The available distances are euclidean, cosine, cityblock and chebyshev.\n",
        "\n",
        "If bootstrap is enabled, it performs bootstrapping to calculate drift based on quantile probability. If not enabled, it uses the threshold parameter. All values above this threshold means data drift. This only applies when bootstrap != True\n",
        "\n",
        "\n",
        "To understand how this is implemented checkout the below link\n",
        "\n",
        "https://github.com/evidentlyai/evidently/blob/476a14152799df0c0f0701014b8717d43585fb6b/src/evidently/metrics/data_drift/embedding_drift_methods.py#L45"
      ],
      "metadata": {
        "id": "4GjNdBlkfLFL",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "report = Report(metrics = [\n",
        "    EmbeddingsDriftMetric('small_subset', \n",
        "                          drift_method = distance(\n",
        "                              dist = 'euclidean', #\"euclidean\", \"cosine\", \"cityblock\" or \"chebyshev\"\n",
        "                              threshold = 0.2,\n",
        "                              pca_components = None,\n",
        "                              bootstrap = None,\n",
        "                              quantile_probability = 0.95\n",
        "                          )\n",
        "                         )\n",
        "])\n",
        "\n",
        "report.run(reference_data = ref_df[:50], current_data = ref_df[:50],  \n",
        "           column_mapping = column_mapping)\n",
        "report"
      ],
      "metadata": {
        "id": "xMklGn_lwUyM",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}